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Any performance analysis based on stochastic simulation is subject to the errors inherent in misspecifying 
the modeling assumptions, particularly the input distributions. In situations with little support from data, 
we investigate the use of worst-case analysis to analyze these errors, by representing the partial, nonparamet- 
ric knowledge of the input models via optimization constraints. We study the performance and robustness 
guarantees of this approach. We design and analyze a numerical scheme for a general class of simulation 
objectives and uncertainty specifications, based on a Frank-Wolfe (FW) variant of the stochastic approxima¬ 
tion (SA) method (which we call FWSA) run on the space of input probability distributions. The key steps 
involve a randomized discretization of the probability spaces and the construction of simulable unbiased 
gradient estimators using a nonparametric analog of the classical likelihood ratio method. A convergence 
analysis for FWSA on non-convex problems is provided. We test the performance of our approach via several 
numerical examples. 


1. Introduction 

Simulation-based performance analysis of stochastic models, or what is usually known as stochas¬ 
tic simnlation, is built on input model assumptions that to some extent deviate from the truth. 
Consequently, the performance analysis subject to these input errors may lead to poor prediction 
and suboptimal decision-making. To address this important problem, the common study frame¬ 
work in the stochastic simulation literature focuses on output variability measures or confidence 
bonnds that account for the input uncertainty when input data are available. Some established sta¬ 


tistical techniques such as the bootstrap (e.g. Barton and Schrnben (1993), Barton et al. (2013)), 
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goodness-of-fit tests (e.g. Banks et al. (]2009)), Bayesian inference and model selection (e.g. Chick 


(2001), Zonaoni and Wilson (2004)) and the delta method (e.g., Cheng and Holland ( |1998 2004)) 
etc. have been proposed and have proven effective in many situations. 

In this paper, we focus on a setting that diverges from most past literature: we are primarily 
interested in situations with insufficient data, or when the modeler wants to assess risk beyond 
what the data or the model indicates. Such situations can arise when the system, service target 
or operational policy in study is at a testing stage without much prior experience. To find reliable 
estimates for the output quantities in these settings, we investigate a worst-case approach operating 
on the space of input models. In this framework, the modeler represents the partial and nonpara- 
metric beliefs about the input model as constraints, and computes tight worst-case bounds among 
all models that satisfy them. More precisely, let Z(P^,...,P™) be a performance measure that 
depends on m input models, each generated from a probability distribution Ph The formulation 
for computing the worst-case bounds are 


min Z{P\...,P'^) and max Z{P\...,P^) (1) 

The set W encodes the collection of all possible P* from the knowledge of the modeler. The decision 
variables in the optimizations in 0 are the unknown models P\i = I,..., m. 

The primary motivation for using ([^ is the robustness against model misspecification, where 
a proper construction of the set ZY* avoids making specific assumptions beyond the modeler’s 
knowledge. The following three examples motivate and explain further. 


Example I (Robust bounds under expert opinion). When little information is available for 
an input model, a common practice in stochastic simulation is to summarize its range (say [a,b]) 
and mean (say /r; or sometimes mode) as a triangular distribution (Banks et al. ( 2009[ ), Chapter 
5), where the base of the triangle denotes the range and the position of the peak is calibrated from 
the mean. This specific distribution, while easy to use, only crudely describes the knowledge of the 
modeler and may deviate from the true distribution, even if a,6,/i are correctly specified. Instead, 
using 

W = {P*: Epi [X*] = /i, supp P* = [a, b]} 


(2) 
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in formulation Q, where is the random variate, Epi[-] is the expectation under P*, and supp P* 
is the support of P*, will give a valid interval that covers the true performance measure whenever 
a,b,fi are correctly specified. Moreover, when these parameters are not fully known but instead 
specified within a range, Q can be relaxed to 

W = {P^ :[i<Epi [X*] < p, supp X* = [a, b]} 


where [/i,p] denotes the range of the mean and a,b denote the lower estimate of the lower support 
end and npper estimate of the upper support end respectively. The resulting bonnd will cover the 
trnth as long as these ranges are supplied correctly. □ 


Example 2 (Dependency modeling). In constructing dependent input models, co mm on 
approaches in the simulation literature fit the marginal description and the correlation of a mnlti- 


variate model to a specified family. Examples inclnde Ganssian copnla (e.g., Lnrie and Goldberg 


(1998), Ghannouf and L’Ecuyer (2009); also known as normal-to-anything (NORTA), e.g. Gario 


and Nelson (1997)) and chessboard distribution (Ghosh and Henderson (2002)) that uses a domain 


discretization. These distributions are correctly constructed up to their marginal description and 
correlation, provided that these information are correctly specified. However, dependency structure 
beyond correlation can imply errors on these approaches (e.g., Lam (2016c)). Formulation (Q can 
be used to get bounds that address snch dependency. For example, suppose P* is a bivariate input 
model with marginal distributions marginal means and covariance p*. We can 

set 


W = {P^ : Pp.i(W’i < = = Pp.,,{X^’^ < qf) = = 1,... ,1,, E[X^^^X^^^] = + 


where (X*’^, A*’^) denote the random vector under P\ and q^P, q^^'^, pP, pP are pre-specified quan¬ 
tiles and probabilities of the respective marginal distributions. Unlike previous approaches, 0 
outputs correct bounds on the truth given correctly specified marginal quantiles and correlation, 
regardless of the dependency structure. □ 
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Example 3 (Model risk). Model risk refers broadly to the uncertainty in analysis arising from 
the adopted model not being fully accurate. This inaccuracy occurs as the adopted model (often 
known as the baseline model), typically obtained from the best statistical fit or expert opinion, 
deviates from the truth due to the real-world non-stationarity and the lack of full modeling knowl¬ 
edge or capability. To assess model risk, a recently surging literature studies the use of statistical 


distance as a measurement of model discrepancy (e.g., Glasserman and Xu[(2014), Lam (2016b)). 


Given the baseline model P^, the idea is to represent the uncertainty in terms of the distance away 
from the baseline via a neighborhood ball 


W = {P^ :d{P\P^)<r]^} 


( 3 ) 


where d is a distance defined on the nonparametric space of distributions (i.e., withont restricting 
to any parametric families). The bounds drawn from formulation (Q assesses the effect of model 
risk due to the input models, tuned by the ball size parameter ry® that denotes the uncertainty level. 
Besides risk assessment, this approach can also be nsed to obtain consistent confidence bounds 
for the true performance measure, when P^ is taken as the empirical distribution and rj and d are 
chosen snitably. For example, when d is a (/>-divergence and the model is finite discrete, one can 
choose ry* as a scaled x^-Quantile. When d is the snp-norm and the model is continuous, then one 
can choose ry* as the quantile of a Kolmogorov-Smirnov statistics. The statistical properties of these 
choices are endowed from the associated goodness-of-fit tests (]Ben-Tal et al. (2013), Bertsimas 


et al. (2014)). □ 


Our worst-case approach is inspired from the literature of robust optimization (Ben-Tal et al. 


(2009), Bertsimas et al. (2011)), which considers decision-making under uncertainty and advocates 


optimizing decisions over worst-case scenarios. In particular, when the uncertainty lies in the prob¬ 
ability distributions that govern a stochastic problem, the decision is made to optimize under the 
worst-case distributions. This approach has been used widely in robust stochastic control (e.g. 


Hansen and Sargent (2008), Petersen et al. (2000)) and distributionally robust optimization (e.g. 
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Delage and Ye (2010), Lim et al. (2006)). It has also appeared in so-called robust simulation or 


robust Monte Carlo in the simulation literature (Hu et al. (2012), Glasserman and Xu (2014)). 


However, the methodologies presented in these literature focus on structured problems where the 
objective function is tractable, such as linear or linearly decomposable. In contrast, Z{-) for most 
problems in stochastic simulation is nonlinear, unstructured and is only amenable to simulation, 
obstructing the direct adaption of the existing methods. In view of this, our main objective is to 
design an efficient simulation-based method to compute the worst-case bounds for formulation 0 
that can be applied to broad classes of simulation models and input uncertainty representations. 


1.1. Our Contributions 


We study a simulation-based iterative procedure for the worst-case optimizations ([^, based on a 
modified version of the celebrated stochastic approximation (SA) method (e.g. [Kushner and Yiii 


(2003)). Because of the iterative nature, it is difhcult to directly operate on the space of continuous 
distributions except in very special cases. Thus, our first contribution is to provide a randomized 
discretization scheme that can provably approximate the continuous counterpart. This allows one 
to focus on the space of discrete distributions on fixed support points as the decision variable for 
feeding into our SA algorithm. 

We develop the SA method in several aspects. First, we construct an unbiased gradient estimator 
for Z based on the idea of the Gateaux derivative for functionals of probability distributions 


(Serfling (2009)), which is used to obtain the direction in each subsequent iterate in the SA scheme. 
The need for such construction is motivated by the difhculty in naive implementation of standard 
gradient estimators: An arbitrary perturbation of a probability distribution, which is the decision 
variable in the optimization, may shoot outside the probability simplex and resnlts in a gradient 
that does not bear any probabilistic meaning and subseqnently does not support simulation-based 
estimation. Our approach effectively restricts the direction of perturbation to within a probability 
simplex so that the gradient is gnaranteed to be simulable. We justify it as a nonparametric version 
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of the classical likelihood ratio method (also called the score function method) (Glynn (1990), 


Reiman and Weiss (1989), Rubinstein (1986)). 


Next, we design and analyze our SA scheme under the uncertainty constraints. We choose to 


use a stochastic counterpart of the so-called Frank-Wolfe (FW) method (Frank and Wolfe (1956)), 
known synonymously as the conditional gradient method in deterministic nonlinear programming. 
For convenience we call our scheme FWSA. Note that a standard SA iteration follows the estimated 
gradient up to a pre-specified step size to find the next candidate iterate. When the formulation 
includes constraints, the common approach in the SA literature projects the candidate solution 
onto the feasible region in order to dehne the next iterate (e.g. Kushner and Yin ( 2003| )). Instead, 
our method looks in advance for a feasible direction along which the next iterate is guaranteed to lie 
in the (convex) feasible region. In order to hnd this feasible direction, an optimization subproblem 
with a linear objective function is solved in each iteration. We base our choice of using FWSA 
on its computational benefit in solving these subproblems, as their linear objectives allow efficient 
solution scheme for high-dimensional decision variables for many choices of the set W. 

We characterize the convergence rate of FWSA in terms of the step size and the number of 
simulation replications used to estimate the gradient at each iteration. The form of our convergence 
bounds suggests prescriptions for the step-size and sample-size sequences that are efficient with 
respect to the cumulative number of sample paths simulated to generate all the gradients until the 
current iterate. The literature on the stochastic FW methods for non-convex problems is small. 


Kushner (1974) proves almost sure convergence under assumptions that can prescribe algorithmic 


specifications only for one-dimensional settings. During the review process of this paper, two other 


convergence rate studies Reddi et al. (2016) and Lafond et al. (2016) have appeared. Both of 
them assume the so-called G-Lipschitz condition on the gradient estimator that does not apply to 
our setting. Consequently, onr obtained convergence rates are generally inferior to their results. 
Nonetheless, we shall argne that onr rates almost match theirs under stronger assumptions on the 


behavior of the iterates that we will discuss. 
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Finally, we provide numerical validation of our approach using two sets of experiments, one 
testing the performance of our proposed randomized discretization strategy, and one on the con¬ 
vergence of FWSA. We also illustrate the impacts from the choices of constraints and investigate 
the form of the resulting worst-case input distributions in these examples. 


1.2. Literature Review 

We briefly survey three lines of related work. First, our paper is related to the large literature 
on input model uncertainty. In the parametric regime, studies have focused on the construction 
of conhdence intervals or variance decompositions to account for both parameter and stochastic 


uncertainty using data, via for instance the delta method (Cheng and Holland (1998, 2004)), the 


bootstrap (Barton et al. ( [2013 ), Cheng and Holloand ( 1997| )), Bayesian approaches (Zouaoui and 
Wilson ( 2003j ), Xie et al. (2014), Saltelli et al. (2010, 2008| )), and metamodel-assisted analysis (Xie 


et al. (2014, 2015)). Model selection beyond a single parametric model can be handled through 


goodness-of-ht or Bayesian model selection and averaging (Chick (2001), Zouaoui and Wilson 


(2004)). Fully nonparametric approaches using the bootstrap have also been investigated (Barton 


and Schruben (1993 [200l| ), |Song and Nelson (2015)). 

Second, formulation ([^ relates to the literature on robust stochastic control (Petersen et al 
(2000), Iyengar (2005[), |Nilim and El Ghaoui (2005)) and distributionally robust optimization 


(Delage and Ye (2010), Goh and Sim (2010), Ben-Tal et al. (2013)), where the focus is to make 
decision rules under stochastic environments that are robust against the ambiguity of the underlying 
probability distributions. This is usually cast in the form of a minimax problem where the inner 
maximization is over the space of distributions. This idea has spanned across multiple areas like 
economics (Hansen and Sargent (|2001 2008)), finance (Glasserman and Xu (2013), |Lim et ah 


(2011)), queueing (Jain et al. (2010)) and dynamic pricing (Lim and Shanthikumar (2007)). In the 


simulation context, Hu et al. (2012) compared different global warming policies using Gaussian 


models with uncertain mean and covariance information and coined the term “robust simulation” 


in their study. Glasserman and Xu (2014) and Lam (2016b) studied approximation methods for 
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distance-based constraints in model risk quantification, via sample average approximation and 
infinitesimal approximation respectively. Simulation optimization under inpnt uncertainty has also 


been stndied in the framework of robust optimization (Fan et al. (2013), Ryzhov et al. (2012)) 


and via the closely related approach using risk measure (Qian et al. (2015), Zhou and Xie (2015)). 
Lastly, optimizations over probability distributions have also arisen as so-called generalized moment 
problems, applied to decision analysis (jSmith (1995, 1993), Bertsimas and Popescu (2005)) and 


stochastic programming (Birge and Wets (1987)). 


Our algorithm relates to the literature on the FW method (Frank and Wolfe (1956)) and con¬ 
strained SA. The former is a nonlinear programming technique initially proposed for convex opti¬ 
mization, based on sequential linearization of the objective function using the gradient at the 
solution iterate. SA is the stochastic counterpart of gradient-based method under noisy observa¬ 
tions. The classical work of Canon and Cnllum (jl968), pdunn (1979) and Dunn (1980) analyzed 


convergence properties of FW for deterministic convex programs. Recently, Jaggi (2013), Freund 


and Grigas (2014) and Hazan and Lno (2016) carried ont new finite-time analysis for the FW 


method motivated by machine learning applications. For stochastic FW on non-convex problems. 


Kushner (1974) focused on almost sure convergence based on a set of assumptions about the prob¬ 


abilistic behavior of the iterations, which were then used to tune the algorithm for one-dimensional 


problems. Recently, while this paper was under review, Reddi et al. (2016) provided a complexity 
analysis in terms of the sample size in estimating gradients and the number of calls of the linear 


optimization routine. Lafond et al. (2016) studied the performance in terms of regret in an online 


setting. Both Reddi et al. (2016) and Lafond et al. (2016) relied on the G-Lipschitz condition that 
our gradient estimator violated. Other types of constrained SA schemes include the Lagrangian 


method (Buche and Kushner (2002)) and mirror descent SA (Nemirovski et al. (2009)). Lastly, gen¬ 


eral convergence results for SA can be found in Fn (1994), Kushner and Yin (2003) and Pasupathy 
and Kim ( 2011[ ). 
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1.3. Organization of the Paper 

The remainder of this paper is organized as follows. Section describes our formulation and 
assumptions. Sectionpresents our performance guarantees and discretization strategy. Section]^ 
focuses on gradient estimation on the space of probability distributions. Section introduces the 
FWSA procedure. Section presents theoretical convergence results on FWSA. Section shows 
some numerical performance. Section concludes and suggests future research. Finally, Appendix 


EC.l gives the remaining proofs of our theorems. 


2. Formulation and Assumptions 


We focus on ..., P™) that is a finite horizon performance measure generated from i.i.d. 

replications from the independent input models Let X* = be T* i.i.d. 

random variables on the space A* C M” , each generated under P\ The performance measure can 
be written as 

1 r^TTL 

Z(P\...,P-) = Ppy,..,p™[/i(X\...,X™)]= f ••• f hi^\...,^n]JdP{xl)---lldP{xT) (4) 

t^l t^l 

where h{-) : —)'M is a cost function, and Epi pm[-] denotes the expectation associated 

with the generation of the i.i.d. replications. We assume that h{-) can be evaluated by the computer 
given the inputs. In other words, the performance measure Q can be approximated by running 
simulation. 

Q is the stylized representation for transient performance measures in discrete-event simulation. 
For example, X^ and X^ can be the sequences of interarrival and service times in a queue, and P^ 
and P^ are the interarrival time and service time distributions. When /i(X^,X^) is the indicator 
function of the waiting time exceeding a threshold, Q will denote the expected waiting time. 

Next we discuss the constraints in 0. Following the terminology in robust optimization, we call 
the uncertainty set for the Tth input model. Motivated by the examples in the Introduction, 
we focus on two types of convex uncertainty sets: 
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1. Moment and support constraints: We consider 

U^ = {P^:Epi[f;{X^)]<^i]J = l,...,s\ supp P^ = A^} (5) 

where X* is a generic random variable under distribution P®, fl : d:”* —)> M, and ^4* C d:”*. For instance, 
when df* = M, //(x) being x or x^ denotes the first two moments. When d^® = /j®(xi,X 2 ) = X 1 X 2 

denotes the cross-moment. Equalities can also be represented via ®) by including Ppi [—//(X®)] < 
—fi}. Thus the uncertainty set ([^ covers Examples andin the Introduction. 

Furthermore, the neighborhood measured by certain types of statistical distance (Example 
can also be cast as (§. For instance, suppose d is induced by the sup-norm on the distribution 
function on M. Suppose P® is a continuous distribution and the baseline distribution P^® is discrete 
with support points j = 1,..., n®. The constraint 

sup|P®(x)-P6®(x)| <7?® (6) 

where P® and P^® denote the distribution functions for P® and P^® respectively, can be reformulated 
as 

F,®(y,+) - y® < P®(y,) < P®(y,-) + y®, / = 1,..., n® 

where F^{yj—) and Ff;{yj+) denote the left and right limits of P^® at by using the monotonicity 
of distribution functions. Thus 


W = {P®: P®(y,+) - y® < P®[/(X® < y,)] < P®(y,-) + y®, j = 1,..., n®, supp P® = ] 


where /(•) denotes the indicator function, falls into the form of Q. Bertsimas et al. (2014) con¬ 
siders this reformulation for constructing uncertainty sets for stochastic optimization problems, 
and suggests to select y® as the quantile of Kolmogorov-Smirnov statistics if P^® is the empirical 
distribution function constructed from i.i.d. data. 

2. Neighborhood of a baseline model measured by cf-divergence: Consider 


^/® = {P®:d4P®,P®)<y®} 


( 7 ) 
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where d^{P^,Pl) denotes the ^-divergence from a baseline distribution PI given by 

which is finite only when P* is absolutely continuous with respect to P^. The function 0 is a convex 
function satisfying = 0. This family covers many widely used distances. Common examples are 
(j){x) = xlogx — X -|- 1 giving the KL divergence, (j){x) = (x — 1)^ giving the (modified) x^-distance, 
and (l){x) = {I — 6 + Ox — x®)/(0(l — 9)), 0 7 ^ 0,1 giving the Cressie-Read divergence. Details of 


^-divergence can be found in, e.g., Pardo (2005) and Ben-Tal et al. (2013) 


As precursed in the Introduction, in the context of simulation analysis where (P^,...,P™) are 
the input models, Z{-) in (|^ is in general a complex nonlinear function. This raises challenges 
in solving ([^ beyond the literature of robust control and optimization that considers typically 
more tractable objectives. Indeed, if Z{-) is a linear function in P*’s, then optimizing over the two 
types of uncertainty sets above can both be cast as specialized classes of convex programs that 
can be efficiently solved. But linear Z{-) is too restrictive to describe the input-output relation 
in simulation. To handle a broader class of Z{-) and to address its simulation-based nature, we 
propose to use a stochastic iterative method. The next sections will discuss our methodology in 
relation to the performance guarantees provided by ([^. 

3. Performance Guarantees and Discretization Strategy 

Suppose there is a “ground true” distribution PJ for each input model. Let Z* and Z* be the 
minimum and maximum values of the worst-case optimizations Q- Let Zq be the true performance 
measure, i.e. Zq = Z{Pq, ..., PJ")- The following highlights an immediate implication of using (Q: 


Theorem 1. If PqGW for all i, then Z^< Zq< Z*. 


In other words, the bounds from the worst-case optimizations form an interval that covers the 
true performance measure if the uncertainty sets contain the true distributions. 

We discuss a discretization strategy for the worst-case optimizations for continuous input distri¬ 
butions. We will show that, by replacing the continuous distribution with a discrete distribution on 
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support points that are initially sampled from some suitably chosen distribution, we can recover 
the guarantee in Theoremup to a small error. The motivation for using discretization comes from 
the challenges in handling decision variables in the form of continuous distributions when running 
our iterative optimization scheme proposed later. 

We focus on the two uncertainty sets and Q. The following states our guarantee: 

Theorem 2. Consider Z{P^,...,P'^) in Q. Assume h is bounded a.s.. Suppose n\i = 1, ... ,m 
and n are positive numbers such that C_<n^jn<C for all i for some 0 < C,C < oo. Consider the 
optimizations 

Z*= min Z{P\...,P'^) and Z* = max Z{P\...,P^) (8) 

Suppose that for each i, one of the two cases occurs: 

1 . 

U^ = {P^■.Ep.[fi{X^)]<^l],l = l,...,s\ suppr = {y\,...,yl.}} (9) 

where {y \.,... are n® observations drawn from a distribution Qf such that the true distribution 
Pq is absolutely continuous with respect to Q® with U = dPQ/dQ^ satisfying ||T®||oo < oo, where 
II • Iloo denotes the essential supremum norm. Moreover, assume that Pq satisfies Epi[fl{X^)] < y,] 
and Epi [//(W)^] < oo for all I = 1,... ,s\ 

2 . 

w = {p^-.dAP\Pi,)<v^} ( 10 ) 

where PI denotes the empirical distribution drawn from n® observations from a baseline P^. 
Assume that L® = dP^/dP^ satisfies ||T®||oo < oo. Moreover, assume Pq satisfies dAPo-,Pb) < 
Epi[(f{UY] < oo. Additionally, assume 4>'{x) < oo for all x £ M+. 

Then 

Zi. ^ Zq A Op < Z* 


( 11 ) 
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Theoremis proved in Appendix |EC.l We have a few immediate remarks: 

1 . Optimizations Q are the sample counterparts of the original worst-case optimizations Q 
with uncertainty sets given by ([^ or 0 . These sample counterparts optimize discrete distributions 
over support points that are sampled from either Q* or depending on the type of uncertainty. 
Roughly speaking, Theorem guarantees that, if the original worst-case optimizations give valid 
covering bounds for the true performance measure (in the spirit of Theorem , then so are the 
sample counterparts, up to an error Op{l/^/n) where n denotes the order of the sample size used 
to construct the sets of support points. 

2. The condition ||T*||oo < oo implies that Q* or Pf; has a tail at least as heavy as PJ. In practice, 
the tail of the true distribution Pq is not exactly known a priori. This means that it is safer to 
sample the support points from a heavy-tailed distribution when moment constraints are imposed, 
and to use a heavy-tailed baseline distribution in the case of (^-divergence neighborhood. 

3. The conditions Epi [//(A*)] < and d^{PQ,P^) < rj'- state that Epi [//(A*)] and d^{PQ,Pl) are 
in the interior of {(^i,..., Zgi) : Zi< / = 1,..., s*} and {z : z < respectively. These conditions 
guarantee that Pq projected on a sample approximation of the support is feasible for Q, which 
helps lead to the guarantee ©• In general, the closer Pg is to the boundary of the uncertainty 
set, i.e., the smaller the values of — Epi[fl{X^)] and — d^{PQ,P^), the larger the sample size 
is needed for the asymptotic behavior in to kick in, a fact that is not revealed explicitly in 
Theorem One way to control this required sample size is to expand the uncertainty set by a 
small margin, say e > 0, i.e., use Ppi[//(A*)] <fil + e and d^{P\Pl,) <r 7 *-|-e, in Q and (10). Note 
that, in the case of moment equality constraint, say Epi[fl{X^)] = /rj, one does have to deliberately 
relax the constraint to /rj — e < Epi [//(A*)] < fil + e for the interior-point conditions to hold. 

A probabilistic analog of Theorem [T] is: 


Theorem 3. Suppose W contains the true distribution P^ for alii with confidence 1 —a, i.e. P{U’' 3 
Pg for all i = 1,... ,m) > 1 — a, then P(Z* < Zg < Z*) > 1 — o;, where P denotes the probability 
generated from a combination of data and prior belief. 
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Theorem follows immediately from Theorem In the frequentist framework, P refers to the 
probability generated from data. However, Theorem can also be cast in a Bayesian framework, 
in which P can represent the prior (e.g., from expert opinion) or the posterior belief. 

Theorem [^reconciles with the established framework in distributionally robust optimization that 
the uncertainty set should be chosen as a confidence set for the true distribution, in order to pro¬ 
vide a guarantee for the coverage probability on the true objective, in the case that P represents the 
generation of data under a true model. For a moment constraint in the form Epi[fl{X^)] < n], one 
can choose as the confidence bound of the moment. Section 3 in Delage and Ye ( 2010| ) discusses 
the construction of confidence regions based on more general semidefinite moment constraints. For 
the distance-based constraint in Q, under the assumption that P* is continuous, ry* chosen as the 
(1 — a)-quantile of sup,j,gjg_^] P(t)/\/n* where B{t) is a standard Brownian bridge, gives an approx¬ 
imate (1 — a) confidence region by utilizing the limiting distribution of the Kolmogorov-Smirnov 


statistic. Bertsimas et al. (2014) studies more general use of goodness-of-fit statistics along this line 
to create confidence regions. For the 0-divergence-based constraint in (|^, under the assumption 
that P® has finite support of size P, Ben-Tal et al. (2013) proposes using ry* = (0"(l)/(2n*))x^i_^ 


in the case is taken as the empirical distribution, where x^i_i i_„ is the (1 — a)-quantile of a 
X^-distribution with degree of freedom r* — 1. This leads to an asymptotic 1 — a confidence region 


by using divergence-based inference (Pardo (2005)). Recent works such as Lam and Zhou (2015), 


Lam (2016a) and Duchi et al. (2016) investigate the tightening of divergence-based regions using 


the empirical likelihood theory (Owen (2001)). 

Similarly, the probabilistic analog of Theorem is: 

Theorem 4. Suppose all assumptions in Theorem^ are in place except that Pp^[//(X*)] < pL\ 
or d0(Pg,Pj) < ?y* now holds true jointly for all i with confidence 1 — a under P. Then P(Z* < 
Zq Op{l/ ^Jn) < Z*) >1 — 0 ;. 

Theorems [^ and [^ allow the translation from (|^, whose input models can be continuously 
represented, to (j^ that is imposed over discrete distributions, by paying a small price of error. In 
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the next section we discuss our algorithm over discrete distributions and point out clearly why the 
discretization is necessary. Obviously, when the input model is finite discrete, the sampling step 
depicted in Theorem is unnecessary, and our subsequent results regarding the algorithm applies 
readily to this case. 

4. Gradient Estimation on Probability Simplices via a Nonparametric 
Likelihood Ratio Method 


Since we work in the discrete space, for simplicity we denote p® = (p® G K”* as the vector 

of probability weights for the discretized input model i. This probability vector is understood to 
apply on the support points {yl ,..., Moreover, let p = vec(p® : z = 1,..., m) G where vec 
denotes a concatenation of the vectors p®’s as a single vector, where N = We denote 

Vi = {{pi ,... ,p;) G : Yl]=iPj = ^ 0)i = 1) • • •) 0 s-s the /-dimensional probability simplex. 

Hence p® G V^i. For convenience, let V = so that p G T*. The performance measure in 

Q can be written as Z['p). Furthermore, denote T = maxi=i_,.._m T® as the maximum length of 
replications among all input models. We also write X = (X^,..., X®®®) and /i(X) = /i(X\ ..., X®®®) 
for simplicity. Recall that I {E) denotes the indicator function for the event E. In the rest of this 
paper,' denotes transpose, and ||x|| denotes the Euclidean norm of a vector x. We also write Farp(-) 
as the variance under the input distribution p. Inequalities for vectors are defined component-wise. 

We shall present an iterative simulation-based scheme for optimizing ([^. The first step is to 
design a method to extract the gradient information of Z(p). Note that the standard gradient of 
Z(p), which we denote as VZ(p), obtained through differentiation of Z{p), may not lead to any 
simulable object. This is because an arbitrary perturbation of p may shoot out from the set of 
probability simplices, and the resulting gradient will be a high-degree polynomial in p that may 
have no probabilistic interpretation and thus is not amenable to simulation-based estimation. 

We address this issue by considering the set of perturbations within the simplices. Our approach 


resembles the Gateaux derivative on a functional of probability distribution (Serfling (2009)) as 
follows. Given any p®, define a mixture distribution (1 — e)p® -|- el®, where 1® represents a point 
mass on y®, i.e. 1® = (0,0,..., 1,..., 0) G Vni and 1 is at the j-th coordinate. The number 0 < e < 1 
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is the mixture parameter. When e = 0, this reduces to the given distribution p\ We treat e as a 
parameter and differentiate , p*“^, (1 — e)p* + el*, p*+^,..., p"*) with respect to e for each 

hJ- 

More precisely, let 


^j(p)=|z(p 


,p*-\(l-e)p* + ei;., p*+\...,p 


*)| 

^ le=0 


Denote '0*(p) = (V’j(p))j=i,...,„i G and '0(p) = vec('j/)*(p) : i = 1,... ,m) G We show that tp 
possesses the following two properties: 


Theorem 5. Given p£V such that p > 0, we have: 

1 . 

m m 

VZ(p)'(q -p) = Y^ V*Z(p)'(q* - pO = - P^) = ^(p)'(q - p) (12) 

i^l i^l 

for any q* G V^i and q = vec{cf = , m), where V*Z(p) G M"* is the gradient of Z taken with 

respect to p*. 

2 . 

V^;.(p) = i?p[/i(x)s}(x*)] (13) 


where s*(-) is defined as 

for X* = (x*i,..., x’j,,) G . 


r^'l 


;(v)=E 


I{x\ = y]) 


rpi 


Pi 


(14) 


The first property above states that ip{p) and VZ(p) are identical when viewed as directional 
derivatives, as long as the direction lies within V. Since the feasible region of optimizations Q lies 
in "P, it suffices to focus on '0(p). The second property above states that V’(p) can be estimated 


unbiasedly in a way similar to the classical likelihood ratio method (Glynn (1990), Reiman and 


Weiss 


(1989|), with s*(-) playing the role of the score function. Since this representation holds 


without assuming any specific parametric form for p, we view it as a nonparametric version of the 


likelihood ratio method. 
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Proof of Theorem\^ To prove 1., consider first a mixture of p* = with an arbitrary 

q® e Vni, in the form (1 — e)p* + eq*. It satisfies 

^Z(p\ ..., p-\ (1 - e)p* T eq^ p*+\ ..., = V*Z(p)'(q* - p^) 

by the chain rule. In particular, we must have 


Mp) = V^ZipYil) - pO = d]Zip) - V^Zipfp^ 


(15) 


where 9*Z(p) denotes partial derivative of Z with respect to p*. Writing (15) for all j together 
gives 

rp^{p) = v^z{p)-iv^zipyp^)r 

where V G M"* is a vector of 1. Therefore 


^*(p)'(q* - P*) = (V*^(p) - (V*Z(p)'pOr)'(q* - pO = V*^(p)'(q* - p*) 


since q%p* €Vni- Summing up over i, (12) follows. 
To prove 2., note that we have 


i;]{p) = -Z{p\ ... ,p-\ (1 - e)p* + el},p*+\ ... ,p”^)L^o 

= i?p[/i(X)s}(X0] 


e=0 


(16) 


where s®(-) is the score function defined as 


d 


^ ^°S((1 - ()p\x\) + = yY)) 




(17) 


e=0 


Here p^xl) =p! where j is chosen such that x] = y). The last equality in (16) follows from the fact 


that 


de 


JJ((1 - e)p\xl) + €l{xl = yp) 




= ^°S((1 - e)pYxl) + el{xl = yp) 

e=0 *=1 




€=0 




Note that can be further written as 
^^pxp + /(x( = yp 




p*(xp 


— Vj) 

^ P"ixi) 




t=i 


Pi 


which leads to (13). 
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The following provides a bound on the variance of the estimator for ipjip) (See Appendix EC.l 
for proof): 

Lemma 1. Assume h(X.) is bounded a.s., i.e. |h(X)| < M for some M > 0, and that p > 0. Each 
sample for estimating by using one sample path of X., possesses a variance bounded from 

above by M^T*(1 —p’'j)lpj. 

The function '0(p) derived via the above Gateaux derivative framework can be interpreted as 


a discrete version of the so-called influence function in robust statistics (Hampel (1974), Hampel 


et al. (2011)), which is commonly used to approximate the first order effect on a given statistics 


due to contamination of data. In general, the gradient represented by the influence function is 
defined as an operator on the domain of the random object distributed under p. Thus, in the 
continuous case, this object has an infinite-dimensional domain, and except in very special cases 


(e.g., tail probabilities of i.i.d. sum; Lam and Mottet (2015)), computing and encoding it is not 
implementable. This is the main reason why we seek for a discretization in the first place. 

5. Frank-Wolfe Stochastic Approximation (FWSA) 

With the implementable form of the gradient 'i/’(p) described in Section]^ we design a stochastic 
nonlinear programming technique to solve ([^. We choose to use the Frank-Wolfe method because, 
for the types oiW we consider in Section]^ effective routines exist for solving the induced linearized 
subproblems. 


5.1. Description of the Algorithm 

/V, _ i ^ 

FWSA works as follows: For convenience denote U = To avoid repetition we focus only 

on the minimization formulation in ([^. First, pretending that VZ(p) can be computed exactly, it 
iteratively updates a solution sequence pi,p 2 ,... as follows. Given a current solution p^, solve 


minVZ(pfc)'(p-pfe) 

p&J 


(18) 


Let the optimal solution to (18) be q^. The quantity — p^ gives a feasible minimization direction 
starting from p^ (note that U is convex). This is then used to update p^ to P/c+i via Pt+i = 
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Pfc + efc(qfc — Pfe) for some step size efc. This expression can be rewritten as p^+i = (1 — efe)pfc 
which can be interpreted as a mixture between the distributions p^ and q^. 

When \/Z{pk) is not exactly known, one can replace it by an empirical counterpart. Theorem]^ 


suggests that we can replace VZ{pk) by i/^(p/c), and so the empirical counterpart of (18) is 


min'0(pfc)'(p-pfe) 

pGU 


(19) 


where V’(pfc) is an estimator of '4){pk) using a sample size R^- Note that all components of '^(pfc) 
can be obtained from these Rk sample paths simultaneously. Letting q/j be the optimal solution to 


(19), the update rule will be p^+i = (1 — efe)Pfc + ^khk for some step size e^. The sample size Rk at 


each step needs to grow suitably to compensate for the bias introduced in solving (19). All these 
are summarized in Procedure [TJ 


Algorithm 1 FWSA for solving (1) 


Initialization: pi G P where pi > 0. 

Input: Step size sequence e^, sample size sequence Rk, k = l,2, 
Procedure: For each iteration k = 1,2 ,..given p^: 

1. Repeat Rk times: Compute 


h(X)s*(X®) for alH = 1,...,m 

using one sample path X = (X^,...,X™), where s*(X*) = fo^ 3 — 

1.. .. ,n* and i = 1,... ,m. Call these Rk i.i.d. replications Q(^)) for j = 1,...,n*, i = 1,... ,m, r = 

1.. .., Rk . 

2. Estimate 'ip{pk) by 

\ " ’■=1 / j=l,...,n'‘ 

3. Solve qfc G argminpg^i/:(pfc)'(p - p^). 

4. Update pfc+i = (1-efe)pfc+ efeqfe. 
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5.2. Solving the Subproblem 


By (12) and the separability of uncertainty set U = the subproblem at each iteration can 

be written as 

m m 

minV?/>*(p)'(q*-p*) = V min ^*(p)'(q* - p‘) (20) 

where ■«/»*(p) = ('0)(p))j=i,..,,ni is the empirical counterpart of '0*(p) obtained in Algorithmj^ Hence 
( p0| ) can be solved by m separate convex programs. The update step follows by taking p^+i = 
vec(p^_,_;^ : i = 1,..., m), where p^_,_;^ = (1 — ^k)Vk + and q^ is the solution to the z-th separate 
program. 


The separate programs in (20) can be efficiently solved for the uncertainty sets considered in 


Section]^ To facilitate discussion, we denote a generic form of each separate program in (20) as 


min f'p* 


( 21 ) 


for an arbitrary vector ^ = (■Cj)j=i,...,„i £ ■ 


Case 1 in Theorem Moment and support constraints. Consider W = {p* e : f;'p* < p], I = 


1,..., s*} where f/ = (/i(y)))i=i,....n* £ • Then (21) is a linear program. 


Case 2 in Theorem^ 4>-divergence neighborhood. Consider 

W = ( 22 ) 

where p;, = (Pbj)j=i„...„i G 7^”' and dAv\ Pb) = YTj=i Pb,j^{P)/Pb,j)- The distribution p*^ is the empir¬ 
ical distribution from so p\ ^ = l/n\ However, we shall discuss the solution scheme slightly 
more generally to cover the case where the original input distribution is discrete and non-uniform. 


Proposition 1. Consider (21) with W presented in (22), where p^ > 0. Let (^*{t) = sup^yQ{tx — 
(j){x)} be the conjugate function of (j), and define 00* (s/0) = 0 if s <0 and 00* (s/0) = -|-oo if s >0. 
Solve the program 


(a*. A*) G argmax^ 


'a>0.AGR 


—a 




i=i 


a 


— arj — X 


(23) 


An optimal solution q* = ((?})j=i,..,,„i for (21) is 
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1. If a* > 0, then 


2. If a* = 0, then 


Qi =Pbr maXr>o 


Pb,- 


Cj + 


a’ 


r — (f{r) 


(24) 


qi _ J HjeM'^Pb,j 


for j G 


(25) 


0 otherwise 

where A4^ = argmin^f^j, the set of indices j G {1,... ,n*} that have the minimum 


Operation (23) involves a two-dimensional convex optimization. Note that both the function 


0* and the solution to the re® one-dimensional maximization ( |24[ ) have closed-form expressions for 
all co mm on (()-divergence (Pardo (2005)). The proof of Propositionfollows closely from Ben-Tal 


et al. (2013) and is left to Appendix EC.l 


In the special case where cj) = xlogx — x -f 1, i.e. KL divergence, the solution scheme can be 
simplihed to a one-dimensional root-finding problem. More precisely, we have 


Proposition 2. Consider (21) withW presented in (22), where 4>{x) = xlogx — x + 1 andpl>0. 


Denote A4® = argmin^f^j as in Proposition^ An optimal solution q® = (g®)j=i,...,r!,i for (21) is: 
1. If - log Pl,j ^ 


Pb,' 


qi — } Pb,j 


for j G Al® 


otherwise 


(26) 


2. //-logEjg^.Kj >r 7 ®, then 


for all j, where /3 < 0 satisfies 


9 = 


Pb,je^ 






(27) 


(28) 


Here (p\{/3) = logX)jPb logarithmic moment generating function of ^ under p^. 

The proof of Proposition follows from techniques such as in Hansen and Sargent (2008), and is 
left to Appendix |EC.1[ 
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6. Theoretical Guarantees of FWSA 

This section shows the convergence properties of our proposed FWSA. We first present results on 
almost sure convergence, followed by a local convergence rate analysis. Throughout our analysis 
we assume that the subproblem at any iteration can be solved using deterministic optimization 
routine to a negligible error. 


6.1. Almost Sure Convergence 


An important object that we will use in our analysis is the so-called Frank-Wolfe (FW) gap (Frank 


and Wolfe 


(1956)): For any p GU, let g{p) = — miUp^^'0(p)'(p — p), which is the negation of the 


optimal value of the next subproblem when the current solution is p. Note that g{p) is non-negative 
for any p GU, since one can always take p = p in the definition of ^(p) to get a lower bound 0. In 
the case of convex objective function, it is well-known that g{p) provides an upper bound of the 


actual optimality gap (Frank and Wolfe (1956)). However, we shall make no convexity assumption 
in our subsequent analysis, and will see that g{p) still plays an important role in bounding the 
local convergence rate of our procedure under the conditions we impose. 

Our choices on the step size and sample size per iteration R]^ of the procedure are as follows: 

Assumption 1. We choose e^,, A: = 1, 2,... that satisfy 

OO OO 

Ck = OO and ^ e\ < 


OO 


k=l 


k=l 


Assumption 2. The sample sizes i?*,. A; = 1, 2 ,... are chosen such that 

OO k— 1 

k=i V Afc 

where for convenience we denote 

Note that among all in the form c/k°‘ for c > 0 and a > 0, only a = l satisfies both Assumptions 
and and avoids a super-polynomial growth in simultaneously (recall that Rk represents 
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the simulation effort expended in iteration k, which can be expensive). To see this, observe that 
Assumptionasserts a G (1/2,1], Now, if a < 1, then it is easy to see that grows 

faster than any polynomials, so that Rk cannot be polynomial if Assumption needs to hold. On 
the other hand, when a = 1, then grows at rate y/k and it is legitimate to choose 

Rk growing at rate k^ with /3 > 1. 

We also note that the expression ej) in Assumption is imposed to compensate 

for a potentially increasing estimation variance, due to the form of the gradient estimator depicted 


in (13) and (14) that possesses p® in the denominator and thus the possibility of having a larger 


variance as the iteration progresses. 

We state our result on almost sure convergence in two parts. The first part only assumes the 
continuity of g{-). The second part assumes a stronger uniqueness condition on the optimal solution, 
stated as: 

Assumption 3. There exists a unique minimizerp* /or minpg^ Z(p). Moreover, g{-) is continuous 
over IT and p* is the only feasible solution such that g{p*) = 0. 


In light of Assumption!^ g plays a similar role as the gradient in unconstrained problems. The 
condition g{p*) = 0 in Assumptionis a simple implication of the optimality of p* (since g{p*) > 0 
would imply the existence of a better solution). 

Our convergence result is: 

Theorem 6. Suppose that h{X.) is bounded a.s. and that Assumptions [7]{^ hold. We have the 
following properties on p^ generated in Algorithm\^ : 

1. Assume that g{-) is continuous and an optimal solution exists. Then D{Z{pk), Z*) ^ 0 a.s., 
where Z* = {Z(p) : p satisfies g{p) = 0} and D{x, A) = inf^g^ ||a; — y\\ for any point x and set A in 
the Euclidean space. 

2. Under Assumption\^ Pit converge to p* a.s.. 

Part [T] of Theorem 1^ states that the objective value generated by Algorithm [T] will eventually get 
close to an objective value evaluated at a point where the FW gap is zero. Part strengthens the 
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convergence to the unique optimal solution p* under Assumption In practice, this uniqueness 
condition may not hold, and we propose combining Algorithm with multi-start of the initial 


solution Pi as a remedy. Section 7.1 shows some numerical results on this strategy. 


6.2. Local Convergence Rate 

We impose several additional assumptions. The first is a Lipchitz continuity condition on the 


optimal solution for the generic subproblem (21), with respect to the coefficients in the objective 


in a neighborhood of the gradient evaluated at p*. Denote v(^) as an optimal solution of (21). We 


assume: 


Assumption 4. We have 

l|v(6)-v(6)ll <^116-611 

for some L>0, for any ^ 1,^2 £ A/a('0(p*))> where A/a('0(p*)) denotes a Euclidean neighborhood 
of'ip{p*) with radius A, and p* is assumed to he the unique optimal solution /or miUp^^ Z(p). 

Next, we denote q(p) as an optimizer in the definition of the FW gap at p, i.e. q(p) G 
argminp-0(p)'(p — p). We assume: 

Assumption 5. 

5(p)>c||'iA(p)||||q(p)-p|| 
for any p GU, where c> 0 is a small constant. 


Assumption 6. 


for any pGlA, for some constant r. 


||^(p)|| >r>0 


Assumption guarantees that the angle between the descent direction and the gradient must be 
bounded away from 90° uniformly at any point p. This assumption has been used in the design 
and analysis of gradient descent methods for nonlinear programs that are singular (i.e. without 


assuming the existence of the Hessian matrix; Bertsekas (1999), Proposition 1.3.3). 









Ghosh and Lam: Robust Analysis in Stochastic Simulation 


25 


The non-zero gradient condition in Assumption effectively suggests that the local optimum 
must occur at the relative boundary of U (i.e. the boundary with respect to the lower-dimensional 
subspace induced by the probability simplex constraint), which warrants further explanation. Note 
that the other alternate scenario for local optimality will be that it occurs in the interior region 
of the feasible set U. In the latter scenario, the gradient at the optimal solution is zero and the 
neighborhood of the optimal solution is either convex or concave (depending on min or max). While 
the convergence analysis can be simplified (and plausibly give a better rate) under this scenario, the 
statistical implication brought by this scenario is rather pathological. Note that our optimizations 
are imposed on decision variables that are input probability distributions. As discussed at the 
end of Section]^ the gradient vector '0(p) is the influence function for the performance measure 
Z{-). If the influence function is zero, it is known that a Gaussian limit does not hold in the 
central limit theorem as the input sample size gets large (where the central limit theorem is on 
the difference between a simulation driven by empirical distributions and the truth). Instead, a 


X^-limit occurs (Serfling (2009), Section 6.4.1, Theorem B). Such type of limit is pathological and 


has never been reported in simulation analysis. Indeed, in all our experiments, the obtained local 
optimal solution is always at the boundary. For this reason we impose Assumption rather than 
a more straightforward zero-gradient type condition. 

The following are our main results on convergence rate, first on the FW gap ^(pfe), and then 
the optimality gap Z(pi~) — Z{p*), in terms of the number of iterations k. Similar to almost sure 
convergence, we assume here that the deterministic routine for solving the subproblems can be 
carried out with high precision. 


Theorem 7. Suppose |/i(X)| < M for some M > 0 and that Assumptions [7]|^ hold. Additionally, 
set 


a 


and Rk = bk^ 
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when k > a, and arbitrary efc < 1 when k < a. Given any 0 < e < 1, it holds that, with probability 
1 — e, there exists a large enough positive integer ko and small enough positive constants v,!}, g such 
that 0 < gi'Pko) < and for k > ko, 


where 


9{Pk) < 


A 


+ Bx { 


1 

{C-'y)k-l 


(7-C)(fco-l)T'-CfcC 

log((fc-l)/(fco-l)) 


i/ 0 < 7 < C* 
ifl>C 
tfl = C 


A = 9{Pko)ko, 


B = 



f 2a^gK 

ag^ -^ 

\ crko 



(29) 


and 


C = a 1- 


2X1^ 2Kv\ 


CT 


-) 


(30) 


Here the constants L,c,t appear in Assumptions\^\^ and\^ respectively. The sample size power j3 
needs to be chosen such that /3 > 27 + 0 + 1. More precisely, the constants a,b,l3 that appear in the 
specification of the algorithm, the other constants ko,'d, g,^, K, and two new constants p>l and 


5 > 0 are chosen to satisfy Conditions 1-8 listed in Appendix EC.l 


Corollary 1. Suppose that all the assumptions are satisfied and all the constants are chosen as 
indicated in Theorem^ Then with probability 1 — e, there exists a large enough positive integer ko 
and small enough positive constants g such that 0 < g{Pko) < o-nd for k > ko. 


Z{pk)-Z{p*)<j^ 


E 


+ Fx { 


_1_ 

(C-7)7(fc-l)7 

_ 1 _ 

(-i-C){,ko-l)l-Cc(k-l)C 

log((fc-l)/(/co-l)) 

C{k-l)C 


i/ 0 < 7 < (7 
ifl>C 
ifl = C 


where 


D = 


a^K 



F = aB 


( 31 ) 


and a,A,B,C,K are the same constants as in Theorem^ 
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A quick summary extracted from Theorem and Corollary is the following: Consider the local 
convergence rate denominated by workload, i.e. the number of simulation replications. To achieve 
the most efficient rate, approximately speaking, a should be chosen to be 1 + a; and /3 chosen to 
be 5 + C + w for some small cj, C > 0. The local convergence rate is then where W 

is the total number of simulation replications. 

Note that the bounds in Theorem and Corollary are local asymptotic statements since they 
only hold starting from k>ko and ^(pfc) < i' for some large kg and small v. It should be cautioned 
that they do not say anything about the behavior of the algorithm before reaching the small 
neighborhood of p* as characterized by 0 < ^(Pfeo) < i'- The above summary therefore should be 
interpreted in the way that, given the algorithm has already run ko number of replications and 
9{Pk) < for a suitably small u (which occurs with probability 1 by Theorem]^, the convergence 
rate of for the optimality gap is guaranteed with probability 1 — e starting from 

that point. 

The summary above is derived based on the following observations: 

1. The local convergence rate of the optimality gap, in terms of the number of iterations k, is at 


best This is seen by (31). 

2. We now consider the convergence rate in terms of simulation replications. Note that at itera¬ 
tion k, the cumulative number of replications is of order ~ k^^^. Thus from Point 1 above, 

the convergence rate of the optimality gap in terms of replications is of order 

3. The constants C and 7 respectively depend on a, the constant factor in the step size, and /?, 
the geometric growth rate of the sample size, as follows: 


(a) (30) defines C = a(l — 2KL'd/{cT) — 2Kv/{c^A)). For convenience, we let w = 
2KL'd/ (cr) -|- 2Kvj(c^A), and so C = a(l — u). 

(b) From Condition 5 in Theorem]^ (shown in Appendix EC.l), we have (3 = 2j + pa + 2 + (^ 
for some C > 0. In other words 7 = (/3 — pa — C — 2)/2. 
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4 . Therefore, the convergence rate in terms of replications is p“ C 2 )/ 2 )ai)/(^+i) 

Let us focus on maximizing 

(a(l - w)) A ((/? - pa - C - 2)/2) A 1 


/3 +1 


over a and /3, whose solution is given by the following lemma: 


Lemma 2. The maximizer of (32) is given by 


a—--, f3 — -l-C + 4 

1 — uj 1 — CJ 


and the optimal value is 


p/(l — a;) + C + 5 


(32) 


The proof is in Appendix |EC.l With Lemmaj^ let us choose r? and n, and hence to, to be small. We 
also choose p to be close to 1. (Unfortunately, these choices can lead to a small size of neighborhood 
around p* in which the convergence rate holds.) This gives rise to the approximate choice that 
a Ri 1 + a; and ,0 5 + C + w. The convergence rate is then 

We compare our results to some recent work in stochastic FW. [Kazan and Luo (2016) showed that 
to achieve e error in terms of the optimality gap one needs 0(l/e^'®) number of calls to the gradient 


estimation oracle, when the objective function is strongly convex. Reddi et al. (2016) showed that 


the number needed increases to 0(l/e^) for non-convex objectives, and suggested several more 
sophisticated algorithms to improve the rate. Corollary and our discussion above suggests that 
we need 0(1/6®+^“'"'^) sample size, for some small > 0, a rate that is inferior to the one achieved 


m 


Reddi et al. (2016). However, Reddi et al. (2016) has assumed that the gradient estimator is 


uniformly bounded over the feasible space, a condition known as G-Lipschitz (Theorem 2 in Reddi 


et al. (2016)), which does not hold in our case due to the presence of p! in the denominator in (14) 


that gives a potentially increasing estimation variance as the iteration progresses. This complication 
motivates our sample size and step size sequences depicted in Assumption and the subsequent 
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analysis. On the other hand, if Assnmption is relaxed to hold for any ^ 1,^2 £ , it can be seen 

that by choosing /3 ~ 3 + C + w onr complexity improves to which almost matches the 


one m 


Reddi et al. (2016) (see Remark EC.l in Appendix EC.l). However, such a relaxed condition 


would not hold if the constraints are linear, because the optimal solutions of the subproblems are 
located at the corner points and will jump from one to the other under perturbation of the objective 
function. 

7. Numerical Experiments 

This section describes two sets of numerical experiments. The first set (Section |7.1[ ) studies the 
performance guarantees from Section regarding our randomized discretization strategy and the 
tightness of the bounds coming from moment constraints. The second set of experiments (Section 


7.2) studies the numerical convergence of FWSA and visualizes the worst-case input distributions. 


Unless specified, in all experiments we terminate the FWSA algorithm at iteration k if at least one 
of the following criteria is met: 

• The cumulative simulation replications Wk reaches 5 x 10®, or 

• The relative difference between objective value Z{pk) and the average of the observed values 

in 30 previous iterations, (X]«li is below 5 x 10“®, or 

• The gradient estimate i/^(p/c) has an ? 2 -iiorm smaller than 1 x 10“®. 


7.1. Performance Bounds for Multiple Continuous and Unbounded Input Models 

We use the example of a multi-class MjGjl queue where jobs from three distinct classes arrive 
and are attended to by one server. Snch a stochastic model captures many real-world systems, e.g. 
the triage of incoming patients in an emergency room, processing calls placed to a customer service 
center, insurance claims routing etc. Let P = represent all the uncertain probability 

measures, where each P® = i = 1,2,3 with j = 1 for inter-arrival and j = 2 for service, 

represents the joint measure of the inter-arrival and service distributions of jobs of the classes. The 
uncertainty in specifying the service and inter-arrival distributions of each class could be due to the 









30 


Ghosh and Lam: Robust Analysis in Stochastic Simulation 


novelty of the service operation with little pre-existing data to infer the distributions adequately. 
The performance measure of interest is the weighed average waiting time: 


Z{P) = Ep 


3 / 


2=1 


(33) 


where the average is observed up to a (fixed) T* = 500 customers of class i and Ci is the cost assigned 
to its waiting times. The server needs to choose a job-selection policy for the next waiting job to 
service. Jobs within each class are served on a hrst-come-hrst-served (FCFS) basis, so the policy 
needs specify only the (non-empty) class of the next served job. This uses a hxed priority ordering 
based on the celebrated c// rule (Kleinrock ( 1976[ )), which prioritizes the classes in decreasing 
order of the product of Ci and the mean service rate of class i] this rule minimizes Z{P) in the 
multi-class MjGjX system of interest. 

We model the uncertainty set via moment constraints on the P* as: 


U = \^U\ whereZY* = {PP/i)’^'<Ppi[(W’^)'] < / = 1,2, j = l,2} (34) 

i 

where the index / = 1,2 represents the first two moments of marginals P'^’K The moment bounds 
and can be specified from prior or expert opinion. Here, to test the information value with 
respect to the accuracy of the moments, we specify the bounds from a conhdence interval on the 
corresponding moments calculated from the M observed data points (M®’-^ =M for all i,j)- For 
example, 

where ta/ 2 ,M-i is the (1 — a/2)-quantile of the Student-t distribution with degree of freedom M — 1, 
is the empirical th moment and is the associated sample standard deviation as observed 
from the M data points. Suppose that the true marginal distribution of interarrival times for 
each class is exponential with rate 0.5 and the true service distribution of the three classes are 
exponentials with rates 2.25,2.0 and 1.75 respectively, to yield an overall traffic intensity of 0.75. 

The FWSA algorithm is run by hrst sampling a discrete approximate support from bivariate 
independent-marginal lognormal distributions as representative of each P* with support size n = 
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50,100,250 (we assume the support size corresponding to each distribution P* is all equal to n). 
Selecting lognormal distributions is a reasonable choice if the modeler conjectures that the true 
distributions are light-tailed. Here we set the means and standard deviations of the lognormals to 
1. The parameter n should ideally be large to minimize discretization error (see Theorem [^, but 
this pays a penalty in the slowness of the FWSA algorithm. 


J \ 

I I 


■ 50 Supp. Pts 

• 100 Supp. Pts 

• 250 Supp. Pts 
SS Average 


100 

Num samples defining moment constraints 



0 . 1 - 


10 



50 Supp. Pts 
■ 100 Supp. Pts 

• 250 Supp. Pts 

SS Average 

100 1000 
Num samples defining moment constraints 


(a) lognormal for discretization 


(b) exponential for discretization 


Figure 1 The range from max to min worst-case objectives when M and n vary as indicated. The dotted-line 
indicates the expected steady-state performance under the true distribution. 


Figure la shows the output of our approach over various n and M to illustrate the effect of 
discretization and the tightness of our bounds with respect to the accuracy of moment information. 
The true steady-state performance measure of the multiclass M/M/1 system, available in explicit 


form (Kleinrock (1976)), is indicated in each plot. The bounds provided by our method under 
each discretization scale are all seen to cover the true performance value. This is predicted by 
Theorem]^ as the moment constraints are all correctly calibrated (i.e. contain the true moments) 
in this example. As support size increases from 50 to 250, we see that the bounds get tighter 
significantly, as a manifestation of the error Op{l/y/n) in Theorem]^ On the other hand, as the 
moment constraints become tighter (shrinking in the order Op(l/\/M)), we also see a significant 
tightening of the bounds. When M is above 100, it appears that the increase of support size above 
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50 has little effect on the accuracy of the bounds, as shown by the non-monotonicity in the length 
of the range as n increases from 50 to 250. This suggests the use of a moderate support size if 


the moment constraints are reasonably tight. Lastly, Figure lb plots the performance when the 
supports of the distributions are sampled from the true distributions, i.e. when the observed data 
is used in the procedure, and shows similar conclusions. 



Figure 2 Box plot of returned optimal solutions from 10 runs on n = 175, M = 100, lognormal discretization 


The above results are implemented with an initialization that assigns equal probabilities to the 
support points. Next, Figure provides a box-plot of the identified optima for ten sample paths 
of the FWSA algorithm where the initial probability masses for the support points (held constant 
for all runs) are sampled uniformly independently with appropriate normalization. The sample 
size for moment constraint generation is M = 100 and the discretization support size is n = 175. 
The returned optimal solutions for each of the minimization and maximization formulations are 
observed to cluster closely, indicating that the formulations have a unique global optimal solution 
for the support set. 

7.2. Convergence of FWSA and Worst-case Input Distributions 

We test the numerical convergence of FWSA. The key parameters in the algorithm are the sample- 
size growth rate (3 and the step-size constant a. Varying these two parameters, we empirically test 
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the rate of convergence of the FW gap to zero analyzed in Theorem and the objective function 
Z(pfc) to the true optimal value Z{p*) analyzed in Corollary We also investigate the magnitude 
of the optimal objective value and the form of the identified optimal solution. 

Consider an MjGjl queue where the arrival process is Poisson known with high accuracy to 
have rate A = 1. On the other hand, the service time Xt for the t-th. customer is uncertain but 
assumed i.i.d.. A simulation model is being used to estimate the expected long-run average of the 
waiting times Z{p) =£’p[/i(X)], where 

h{X) = ^^W, 

1 

and Wt is the waiting time obtained from Findley’s recursion, and At ~ Exp{\) is the interarrival 
time of the {t — l)-th customer. 

We test our FWSA with a KL-divergence-based uncertainty set for Xt as 


M=|p:gp,log(^)<, 

where pt, = (pbj)j=i,..,,n is a baseline model chosen to be a discretized mixture of beta distribu¬ 
tion given by 0.3 x Beta(2,6) -1-0.7 x Beta(6,2). The discrete supports are obtained by uniformly 



discretizing the interval [0,1] into n points, i.e. yj = {j + l)/n. 


The set (35) provides a good testing ground because steady-state analysis allows obtaining an 


approximate optimal solution directly which serves as a benchmark for verifying the convergence 


of our FWSA algorithm. As T grows, the average waiting time converges to the corresponding 
steady-state value, which, when the traffic intensity Pp = Ep[Xt] is less than 1, is given in closed 


form by the Pollaczek-Khinchine formula (Khintchine (1932)) as: 


^oo(p) 


PpEp[X,] + Varp{X,) 
2(1-Pp) 


So, when T is large, an approximation to the worst-case performance estimate can be obtained 
by replacing Z{'p) with Zoo(p). (In experiments, a choice of T = 500 seems to show close agreement.) 


With £'p[Ai] = '^PjVj and Ep[X^] = '^PjPj, the steady-state approximation to (35) is given by 
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(SS), which is equivalent to (SS') via variable substitutions (see p.l91 in [Boyd and Vandenberghej 


(2009) for further details): 




mm , ^ 

P 2(l-X;,Pj%) 


(SS) 


mm 

p 


E 


WjVj 


(SS') 


s.t. 


) <V 


Pi 

\Pbj 


s.t. ^Wjlog 


Wi 


< Tjt 


3 

0<Pj<l, Vj = l,. 


,n 


tPbj 

J 

2t-2'^Wjyj = l 
3 

3 

0 <Wj <t yj = i,...,n 


Figure [^captures the performance of our FWSA algorithm as a function of the a and /3 parame¬ 
ters. Figures [3a||^ plot the (approximate) optimality gap as a function of the cumulative simulation 


replications Wk for the maximization problem under (35). We set the parameters rj = 0.025, n = 100 


and T = 500. Figures [3^ [3l^ and [3^ provide further insights into the actual observed finite-sample 
performance (When interpreting these graphs, note that they are plotted in log-log scale and thus, 
roughly speaking, the slope of the curve represents the power of the cumulative samples whereas 
the intercept represents the multiplicative constant in the rate): 


Fig. 3a v.s. 3b- 3c: Convergence is much slower when a < 1 no matter the value of /?. 


Fig. 3b: For a > 1, convergence is again slow if /3 > 4. 


• Fig. 3b: For a slightly greater than 1, the convergence rates are similar for (5 G [2.75,3.25] with 
better performance for the lower end. 

• Fig. [^ ' For /3 = 3.1, the rate of convergence generally improves as a increases in the range 
[1.10,2.75]. 


Figs. 3a. 3b and 3c: The approximation from (SS) of the true Z{y>*) has an error of about 


0.006 for the chosen T, as observed by the leveling off of all plots around this value as the sampling 
effort grows. 

Figureshows the FW gap as a function of the iteration count. In general, the sample paths with 
similar /3 are clustered together, indicating that more effort expended in estimating the gradient at 
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(a) small a, /3 varied as shown (b) a = 1, /3 varied as shown 




(c) /3 = 3.1, a varied 


(d) Frank-Wolfe gap vs iteration count 


Figure 3 


Figs 


3a 


3b 


andl^plot the optimality gap of the FWSA algorithm for the MjGI jl example as function 


of cumulative simulation samples (both in log-scale), under various combinations of step-size parameter 


a and sample-size growth parameter /?. The three figures have the same range of values in both axes. 


Fig |3d| shows the FW gap as a function of iteration count (both in log-scale). All figures provide the 


legend as a, 13. 


each iterate leads to a faster drop in the FW gap per iteration. Within each cluster, performance 
is inferior when a < 1, consistent with Theorem Since most runs terminate when the criterion on 
the maximum allowed budget of simulation replications is expended, the end points of the curves 
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indicate that a combination of a > 1 and a /? of around 3 gains the best finite-sample performance 
in terms of the FW gap. These choices seem to reconcile with the discussion at the end of Section 
6.2 when Assumptionis relaxed to hold for any ^ 1,^2 £ 1^^- 



(a) (min) pt, from beta-mixture (b) (max) p;, from beta-mixture 

Figure 4 Optimal solutions p* identified by the FWSA algorithm with n = 100 and rj = 0.05, setting a = 1.5,/3 = 
2.75. The gray bars represent the baseline p.m.f. p;,. 


Finally, Figurej^shows the form of the optimal distributions p* identified by the FWSA algorithm 
for the minimization (Figure [4a| and maximization (Figure [db] ) problems under (35). The optimal 
distributions follow a similar bimodal structure as the baseline distribution pf,. The maximization 
version assigns probability masses in an unequal manner to the two modes in order to drive up 
both the mean and the variance of p, as (SS) leads us to expect, whereas the minimization version 
on the other hand makes the mass allocation more equal in order to minimize the mean and the 
variance of p while maintaining the maximum allowed KL divergence. 

8. Conclusion 


In this paper we investigated a methodology based on worst-case analysis to quantify input errors 
in stochastic simulation, by using optimization constraints to represent the partial nonparametric 
information on the model. The procedure involved FWSA run on a randomized support set and 
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a gradient estimation technique based on a nonparametric version of the likelihood ratio or the 
score function method. We studied convergence properties of the proposed FWSA. We also tested 
it and verified the theoretical implications on several queueing examples. 

We suggest several lines of future research. First is the extension of the methodology to dependent 
models, such as Markovian inputs or more general time series inputs, which would involve new sets 
of constraints in the optimizations. Second is the design and analysis of other potential alternate 
numerical procedures and comparisons with the proposed method. Third is the utilization of the 
proposed worst-case optimizations in various classes of decision-making problems. 
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Proof of Theorem\^ Let S{y) be the delta measure at y. For each z = 1,... ,m, define 

h m) 

i.e., the distribution with point mass L*(?/®)/L(y®) on each y®, where U = dP^/dQ^ or = 
dP^/dPf; depending on Case 1 or 2. We first show that as n —^ oo, the solution is feasible 

for the optimization problems in Q eventually. 

Consider Case 1. Since ||L*||oo < oo, we have VarQi{L{X'')) < oo. Moreover, for each l = 
EQ^{fl{X^)L{X'')f = Epi[fi{X^YL{X^)] < CEp^^[fl{XY‘^] < oo for some C > 0 by using ||L*||oo < 
oo again and the assumption Epi[fl{XY'^\ < oo. Therefore, by law of large numbers. 

Since Egt [//(X*)L(X®)] = Epi [//(X*)] < y® by our assumption, we have Epi [//(X*)] < y] eventually 
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by law of large numbers, since Epi[(j){Uy] < oo by assumption. For the second term in (EC.l), 


note that 




t=l 


1 " 

^ E CEp.\U{X^)\ < oo 




for some C > 0, where the first inequality follows from the assumption that < oo for all x G M'*' 
and is uniformly bounded a.s. since ||.b*||oo < oo, and the convergence follows from law of large 
numbers by using ||L*|| < oo and hence £'pi|L*(X*)p < oo. Moreover, 

b 

(!/»■) Ei.i'fe) 

by law of large numbers again. Hence the second term in (EC.l) converges to 0 a.s.. Overall, ( |EC.l ) 
converges to d,f>{PQ,Pl) a.s.. Since d^{PQ,Pl) < ry® by our assumption, we have d^{P'',P^) < rj'^ 
eventually as re® —>■ oo. 

Next we consider the objective in (I^. We show that Z{P ^,..., P®®®) satisfies 


Z{P\...,P"^) = Z{P\...,P"') + J2 / ^\x-,P\---,P"")d{P^-P^){x) + 0 

i=l \i=l 

for any probability distributions .P®,P®, where 

^\x;P\..., Pn = i] [hiX\ ..., X®'®)|X,® = x] 

t=i 

and F® and P® are the distribution functions associated with P® and P®. 

Consider 


'*112 


(EC.2) 


(EC.3) 


m T* 


Z(P\...,P”®) = 


h{^\ ■ ■ ■, x'”) n n 


2=1 

m T" 


:Z(P\...,P®®®) + EE/ '/ /i(x\...,x”®)d(P®-P®)(x®) ll dP^(xi) + P 

i=l t=l 

(EC.4) 


where R is the remainder term written as a finite sum of terms in the form 
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with I > 2. (EC.4) is obtained by expanding Hili IlLi d[{P’' — P''){x\) + P'^{x\)] and grouping the 


terms with only one factor of d(P* — P*)(xJ). The second term in (EC.4) can be simplified to 


E 




On the other hand, any term in the form (EC.5) satisfies 


/ ••• / n dp^{xi) 


< 




k=l 


where I >2. Thus we conclude (EC.2). 


Therefore, putting P® = P® and P® = Pq® in (EC.4), we have 

m p / m \ 

ZiP\ ..., P®-) = Z(Po\ ..., P-) + Y, P^,---, P^)d{P^ - p:,){x) + O E - i^oIlL 

i=l \i=l J 

(EC.6) 

where P® and Pg® are the distribution functions associated with P® and Pg®. The rest of this proof 
focuses on i that is in Case 1, but Q® can be replaced simply by P^® if i is in Case 2. Consider the 


second term in (EC. 6 ). For each z, we have 


VrP J (/ 9 ®( 


x;Pj,...,Pg™KP®-Pg®)(x) = Vn 


(1/n®)^” i(^®(ypL®(y®) 


where for convenience we write v^g(x) = ¥?®(x; Pg^,..., P™). Since 


-Pq,K(X®)P®(X®)] 

(EC.7) 
< oo by 


oo < oo, we have 


the definition in (EC.3). Together with the assumption ||L®|| < oo, we have VarQi{(p'l{X'‘)V{X^)) < 


oo and VarQi{P{X’‘)) < oo. Thus, by the delta method, (EC.7) converges in distribution to 
A^(0, cr^), where 



1 

/ 

EarQ.(<^®(X®)P(X®)) 

Covq.{^1{X^)U{X'),P{X')) 


-Pg,[(^®(X®)P(X®)] 


Covq4^‘o(X%^(X^),P(X^)) 

VarQ.{P{X^)) 



1 

-Pg.[(^®(X®)P(X®)] 


Hence f <pg(x)d(P" — Pg®) = Op hence 


E 


(^®(x)d(P®-Pg®)(x)=Op{^ 


(EC. 8 ) 
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by the assumption that C < n'' jn < C. 


Now consider the remainder term in (EC.6). We have 
(F* - F^){x) = - F^,{x) 

(1/n*) Yjf=i L"{y})I{y} <x)- EQi[L^{X^)I{X^ < x)] Fo*(x)((l/n*) L^y}) - 1) 




iiMEUmy]) 

(EC.9) 


where the inequality in /(y® < x) is dehned such that each component of y® and x satisfy the 


inequality. Since ||L®|| < oo, {L®(•)/(•< x) : x G Al®} is Donsker (Van Der Vaart and Wellner (1996), 
Example 2.10.8). Therefore, 

^ < a:) - Fq.[L®(V®)/(V® < x)] j ^ G in ^“(V®) 

where = {/ : V® —> M|||/(-)||oo < oo} and G is a Gaussian process on V® with covariance 

C'ou(G(xi),G(x 2 )) = CovQi{U{X^)I{X^ < Xi),L®(V®)/(V® < X 2 )). Moreover, by continuous map¬ 
ping theorem, 


E ^ - ®^Q® [L\xy{x^ < x)] 


i=i 


which also gives 


1 ^ <x)- Eq. [L®(V®)/(X® < x)] 


i=i 


Eurthermore, since ||T®||oo < 00 , we have VarQi[U{X'^)) < 00 . By law of large numbers, we have 
(l/n®)X)"=i'^*(yp“^ 1 Thus, 

(1/n*) Eii L\v^^I[y] < x) - Fq.[L®(V®)/(V® < x)] 


(i/r^0E;=iT®(y;) 

by Slutsky’s theorem. 

On the other hand, since VarQi{U‘{X'‘)) < 00 , we have 


(EC.10) 


n® I — 
n 


- J]L®(y®)-l I ^V(0,VarQ.(L®(X®))) 
i=i 
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by central limit theorem. Together with (1/n*) ^ have 


by Slutsky’s theorem again. So 


LKy})y 






Therefore, from (EC.9), {F^ — Fq){x) satisfies 




= vn 


< 4 


(EC.11) 


(1/n*) J2]=i F{y^I{y] <x)- EQi[LpX^)I{X^ < x)] Fj(x)((l/n 


/ TL^ 

{'^/n^)T7=iF{tj)I{y]<x)-EQi[LpX^)I{X^<x)] 

(1/n* 

2 

+ \/n* 

OO 

F*(x)((l/n^)E;li 





j)-l) 


by (EC.10) and (EC.11). Hence the remainder term in (EC.6) is Op{l/y/n) by the assumption that 


Q<n^jn<C. Combining with (EC.8), (EC.6) gives 


ZiP\...,Pn = ZiP^,...,Pn + Op{^)+Op(^)=Zo + Op(^ 


Since we have shown that is feasible for ([^ eventually, we have 

F Zq + Op ( —— ) < Z 


This concludes the theorem. 

Proof of Lemma [i] We have 

Earp(/i(X)s}(X^)) < E^{h{X)s){X.^)r < M^E^{s]{X.^)y = MpV ar^{s)0O)) + {Ep,[s]{yP)]f) 

(EC.12) 


Now note that by the definition of s}(X) in (14) we have Ep[s® (X®)] = 0 and 

rWarp(/(X* = y))) _r*(l-p}) 


Earp(s}(X*)) = 


{p)Y 


Pi 


Hence, from (EC.12), we conclude that Harp(/i(X)s)(X*)) < M^T*(1 —p'‘j)lp]- 
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Proof of Proposition^ Consider the Lagrangian relaxation 


max mm 

o;^0,AGIR p^!>0 


“ I 1 ^ ^ I ^ 


i=i 




(EC.13) 


.i=i 


= max —a > , max < — 

a> 0 ,AGR 

i=l 


= max —a 
a> 0 ,AGR 


^Pl. 

i=i 

P 

'Epk 


(j+Xp) P] 


^ -ar?*-A 

P}>o I a VKj. 


i=i 


io + ^ 


a 


— ap'’ — A 


In the particular case that a* = 0, the optimal value of (EC.13) is the same as 


max mm 
AgK p^>0 






. 1=1 


whose inner minimization is equivalent to miUpigpi Yl^=iP]^j ~ iiii%G{i,...,nq ^j- Among all solutions 
that lead to this objective value, we find the one that solves 


mm 


(EC.14) 


Now note that by the convexity of (f, for any '^j^j^iP) = 1, we have 

P; \ Pi,j P] \ . ^ , f 1 




I] Ar Y. ^ 


3^ r-cMi ■,c^A^ ‘^r&M^Pb,r \Pb,j / \l^jeMiPb,j 


> 


Y pIj 


(EC.15) 


It is easy to see that choosing p! in (EC. 14) as q] depicted in (25) archives the lower bound in 


(EC.15D, hence concluding the proposition. 


Proof of Proposition^ Consider the Lagrangian for the optimization (21) 


p™™. I] + ^\ Yp>^^- p' 


(EC.16) 


1=1 \l=i P^’^ 

By Theorem 1, P.220 in jLuenberger (1969), suppose that one can find a* > 0 such that q* = 


ni £ 'Pn.i niinimizes (EC.16) for a = a* and moreover that a* ( YAj=i “t— i?* ) = 0, 


then q® is optimal for (21). 


Suppose a* =0, then the minimizer of (EC.16) can be any probability distributions that have 


masses concentrated on the set of indices in A4b Any one of these distributions that lies in W 
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will be an optimal solution to (21). To check whether any of them lies in W, consider the one 


that has the minimum d^{q^,pl) and see whether it is less than or equal to ryh In other words, 
we want to hnd minpi_^g^i.^ j)- The optimal solution to this mini¬ 
mization is Vb t j G which gives an optimal value — logX],cAi»Pb u Thus, if 


~^osYjeM^Pb j — optimal solution q* to (21) given by ( [26) ). 

In the case that a* = 0 does not lead to an optimal solution, or equivalently — log Pb j > V\ 


we consider a* > 0. We write the objective value of (EC. 16) with a = a* as 


i=i i=i 


^ a* Vj'’ 


By Jensen’s inequality. 


(EC.17) 


E 




1=1 


giving 


XI ^oP) + «* X P) ^ > -a* log X pI,. 




1=1 1=1 

It is easy to verify that putting p] as 


Pb,j 


1=1 


Q = 


Pl^e ‘■1 




E n^ A 

r=lPb,re 


-ir/c 


(EC.18) 


gives the lower bound in (EC. 18). Thus g* minimizes (EC. 17). Moreover, a* > 0 can be chosen such 
that 

E^; =1 

Pb,j a*Yj=iPb,je j=i 


Letting /3 = — 1/a*, we obtain (27) and (28). Note that (28) must bear a negative root because 


the left hand side of (28) is 0 when /? = 0, and as /3 —l —oo, we have v^|(,d) = + 

^ogYjeM^Pb j + 0{e^^^) for some positive constant Ci, and (p\'(P) = YjeM^ T 0(e°^^) for some 
positive constant C 2 , and so (5(p'^^{j3) — = — fogZ]jGiv(*Pb j ^ ^ jg nega¬ 

tive enough. 
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Proof of Theorem\^ The proof is an adaptation of Blum (1954). Recall that = vec(p^ : i = 


1,... ,m) where we write each component of p^ as ply Let N = be the total counts of 

support points. Since h(X) is bounded a.s., we have |h(X)| < M a.s. for some M. Without loss 
of generality, we assume that Z{p) > 0 for all p. Also note that Z(p), as a high-dimensional 
polynomial, is continuous everywhere in U. 

For notational convenience, we write d^. = q(pfc) — Pfc and = q(pfc) — Pt, i.e. d^ is the A:-th 
step best feasible direction given the exact gradient estimate, and d^ is the one with estimated 
gradient. 

Now, given pt, consider the iterative update Pfc+i = (1 — efc)pfe -|- efeq(pfe) = Pfe + Cfedfe. We have, 
by Taylor series expansion, 


Z (Pfe+i) — Z (pfc) + CfcV Z (pfe)'dfe -|- 


+ ^fcefcdfe)dfc 


for some 9k between 0 and 1. By Theorem we can rewrite the above as 


Z(pfe+i) = Z{pk) + ekipiPkYdk + yd^V^Z(pfc -h 9kekdk)dk 


(EC.19) 


Consider the second term in the right hand side of (EC. 19). We can write 


^(Pfc)'dfc = -ipiPkYdk + (-0(Pfc) - -0(pfc))'dfc 

< -0(pfc)'dfc -I- (-0(pfc) - -0(pfc))'dfc by the dehnition of d^ 

= ^(pfc)'dfc -I- (ipiPk) - '0(pfc))'dfc -h (^A(Pfc) - 'ip{pk)ydk 
= ^(pfc)'dfc -i- ('0(pfc) - '0(pfc))'(dfe - dfe) (EC.20) 


Hence (EC. 19) and (EC.20) together imply 


Z{pk+i) < Z{pk) + efc-0(pfe)'dfe + efc(-0(pfe) - -j/)(pfe))'(dfc - d^) -h yd(,V^X(pfe + 9k€kdk)dk 


Let Tk be the filtration generated by pi,... ,Pfc. We then have 


E[Z{pk+i)\J^k] < Z{pk) + ek'ip{Pkydk + ekE[{'4){pk)-'ip{pk)y{dk-dk)\Ek] 

+ |E[d'fcV"Z(pfc T 9k€kdk)dk\Ek] 


(EC.21) 
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We analyze (EC.21) term by term. First, since 2’(p) is a high-dimensional polynomial and U is 


a bounded set, the largest eigenvalue of the Hessian matrix V^Z(p), for any p gZY, is uniformly 
bounded by a constant H >0. Hence 


E[dy^Z{p, + J-fc] < HE[\\dkr\Ek] < E < oo 


(EC.22) 


for some E > 0. Now 


E^ipiPk) - rpiPk))'{dk - dk)\Ek] 


(EC.23) 


< JE[\\ip{pk) - ip{pk)\\‘^\J^k]E[\\dk - dkW^lEk] by Cauchy-Schwarz inequality 


< \JE[\\ip{pk) - ip{pkW\Ek]E[2{\\dk\\^ + ||dfc||2)|J'fc] by parallelogram law 


< ■\/8mE;[||’0(pfe)-•0(pfc)p|J’fc] since ||dfc||M|dfef < 2m by using the fact that Pfc, q(pfc),q(pfc) G (P 

by Lemma 


< 




SmM^T 1 - Pi,. 


R, 


< M, 




8mTN 


y Rkminijpl j 

Note that by iterating the update rule (1 — efe)pfc -|- e^q^, we have 


k-1 

minpl - > JJ(l-e,)(5 

j=i 


where 6 = j > 0. We thus have (EC.24) less than or equal to 

-k-l 


8mTN j-r 

Mxl „„ ]_|(1 


SRk 


- e. 


\-l/2 


i=i 


Therefore, noting that '0(pfc)'dfc < 0 by the definition of d*,, from (EC.21) we have 

-k-l 


E[z(pw.) - 


+ ■ 


elV 


i=i 


and hence 


fc=i 


k-l 


(EC.24) 


(EC.25) 


(EC.26) 


elV 


Y, E[E[Z{Pk+i) - Z{pk)\RkV] < E n(i - + E ^ 

k — 1 7 * — 1 k — 1 


By Assumptionsandand Lemma EC.l (depicted after this proof), we have Z{pk) converge to 
an integrable random variable. 
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Now take expectation on (EC.21) further to get 


E[Z{pk+i)] < E[Z{pk)] + efc£^[i/)(pfc)'dj + ekE[{ip{pk) - -0(pfc))'(dfc - dfc)] 


+ -^E[d'i.V'^Z(pf^ + 0fcefcdfc)dfc] 


and telescope to get 


E[Z{pk+i)] < E[Z(pi)] + ^e,E;['0(p,)'d,] + ^ejE[(V’(Pj) -d, 


i=i 


1=1 


k ^2 


+ ^^E[d'V^Z{p,+e,e,d,)d, 


(EC.27) 


1=1 


Now take the limit on both sides of (EC.27). Note that E[Z{pk+i)] E[Zao] for some integrable 


Zoo by dominated convergence theorem. Also Z(pi) < oo, and by (EC.22) and (EC.25) respectively, 
we have 


k ^2 


OO 2 


e E 


hm ^E[d;.VZ(p, + 0,e,d,)d,] < ^ < 


k^oo ^^ 2 
1=1 


OO 


1=1 


and 


lim ^ejE[(i/>(pj) -iA(p,))'(d, -d,)] < mJ 

K—>-00 ^ ^ V 


8mTN^ e ^ ^ 




1=1 's/ i=i 


OO 


Therefore, from (EC.27), and since E[’0(pj)'dj] <0, we must have X]j=i converges 

a.s., which implies that limsup;._,,j,o>-^['*/’(Pfc)^‘ife] = 0. So there exists a subsequence ki such that 
limi_>oo E[’j/>(pfeJ'dfcJ = 0. This in turn implies that 'i/i(pfcj'dfc^ A- 0. Then, there exists a further 
subsequence li such that 'i/>(pij'd;. —I 0 a.s.. 

Consider part [^of the theorem. Let S'* = {p G "P : ^((p) = 0}. Since g{-) is continuous, we have 
Zl(p;.,S*) —)• 0 a.s.. Since Z(-) is continuous, we have D{Z{pi.),Z*) —i 0 a.s.. But since we have 
proven that Z{pk) converges a.s., we have D{Z{pk),Z*) —I 0 a.s.. This gives part[^of the theorem. 

Now consider part By Assumption]^ since p* is the only p such that ^(p) = 0 and g{-) is 
continuous, we must have p/. —)• p* a.s.. Since Z(-) is continuous, we have Z(piJ —i Z(p*). But 
since Z(pk) converges a.s. as shown above, we must have Z(pk) —> Z(p*). Then by Assumption]^ 
again, since p* is the unique optimizer, we have p^ —)• p* a.s.. This concludes part ]^ of the theorem. 
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Lemma EC.l (Adapted from Blum (1954)). Consider a sequence of integrable random vari¬ 
able Yk,k = 1,2, _ Let be the filtration generated by Yi,... ,Yk. Assume 


Y,E[E[Yk+i-Yk\Fu\^]<^ 

fc=i 

where denotes the positive part of x, i.e. = x if x >0 and 0 if x < 0. Moreover, assume 
that Yfc is bounded uniformly from above. Then Y^ —)• Yoo o.-s., where Y^o is an integrable random 
variable. 


The lemma follows from Blum (1954), with the additional conclusion that is integrable, which 


is a direct consequence of the martingale convergence theorem. 


Theorem EC.l (Conditions in Theorem]^. Conditions 1-8 needed in Theorem^are: 

1 . 

f AKMTm KLd\ 
feo > 2a 1 --h 




cr / 


2 . 


3. 


2KL'd lagK \ 2aKL'dQ g 2Kn^ 


2ALi9 2Kn 
-1—ww < 1 


4. 


ko > 


ap 

p-1 


5. 


fi> pa+ 2j +2 


6. 


*0-1 

nil-'.)"' 

i=i 

*0-1 

+n (1 “ c) 

i=i 


M^TN 1 

Td^b il3-pa-l){ko-iy-^ 

_i/ 2 ^ 18mTN 1 

g\ 5b ({fi - pa)/2-'y-l){kQ- 


■i)/3/2-7-i 


< e 
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7. K > 0 is a constant such that |x'V^Z(p)y| < iCHxH||y|| for any gM” and pG^ (which 
must exist because Z(-) is a polynomial defined over a hounded set). 

8. 5 = mini=i,...,m 

Proof of Theorem\^ We adopt the notation as in the proof of Theorem In addition, for 
convenience, we write xpk = ^(Pfc), '•Pk = 'ipiPk), (Ik = q(Pfc), q/c = q(Pfe), Qk = g{Pk) = --ipiPkYdk, 
VZk = VZ(pfc), and V^Zk = V^Z(pfc). Note that Pk+i = Pk + (kdk- 

First, by the proof of Theorem]^ given any v and kg, almost surely there must exists a ko>ko 
such that pkQ < v. If the optimal solution is reached and is kept there, then Pk = 0 from thereon 
and the algorithm reaches and remains at optimum at finite time, hence there is nothing to prove. 
So let us assume that 0 < gk^ < Moreover, let us assume that v is chosen small enough so that 
for any p with g{p) < v and p > 0, we have '0(p) G A/a-,?('*/’(p*)) (which can be done since is 
assumed continuous by Assumption and 'ipip) is continuous for any p > 0 by the construction in 
Theorem . 

We consider the event 

OO OO 

U U Si 

k—kQ k—kQ 

where 

Sk = {\\'>Pk-'ipk\\ >'(9} 


and 


£'k = {\{'4’k-'4’ky{dk-dk)\> 


Note that by Markov inequality, 


P{Sk) < 


EW'tPk-ipkW^M^T^l-pl^ ^ M^TN 


k-l 


^2 


^^Rk 


E 




< 


Pk, 


^^RkS -• , 


11(1 


where the second inequality follows from Lemma and the last inequality follows as in the deriva¬ 


tion in (EC.24) and (EC.25). On the other hand, we have 


< k^E\{'ipk-'il’ky{dk-dk)\ ^ c 


-k-l 


i=i 
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by following the derivation in (EC.24) and (EC.25). Therefore, 




k—kQ 


k—kQ 
k-1 


M^TN ^ 1 , M ISmTN ^ ^-r^ , ,,, 

<—E n(-^ 7 V—£ 71S 




k=ko j=l 


k-1 


kQ-1 

i=i 


oo - k — 1 


-1 


W^TN 

^ Rk - 

k=ko j=kQ 


k=ko ■ - j = 

kQ-1 


E^n(i-‘.)-+n(‘-'.) 


- 1/2 


M l8mTN A ^ 


i=i 


(EC.28) 


Now recall that efc = a/k. Using the fact that 1 — x > e for any 0 < x < {p — 1)/p and p > 1, we 
have, for any 

“ < P~^ 

k ~ p 

or equivalently 


k> 


ap 

p-1 


we have 


1 - Cfc = 1 - - > 

k 


Hence choosing ko satisfying Condition 4, we get 


fe-i 


3=kQ 


k-1 
k(\ — \ 


(EC.29) 


Therefore, picking R^. = bkP and using (EC.29), we have (EC.28) bounded from above by 


-1 


kQ-l 

IK'-v 

3 = 1 
ko-1 

i=i 

kQ-l 

+ n (1 “c) 

3 = 1 




feo —1 


(ko — l)p°‘kP~P°- 

k=ko '' ^ ’ i=l 


n (1 - 


1/2 M ISmTN ^ 1 

’ Tv Sb ^ (feo - 


M^TN 


1 


i(3-pa-l){ko-iy- 


-1/2: 


M / 8mT N 


Q \ 5b {i/3-pa)/2-'y-l){ko-iy/'^-"<- 


(EC.30) 


1/2 


if Condition 5 holds. Then Condition 6 guarantees that P{£) < £■ 
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The rest of the proof will show that under the event we mnst have the bound (29), hence 


concluding the theorem. To this end, we first set up a recursive representation of gk- Consider 

9k+i = —'0fc+idfc+i = —'0^i(qfe+i — Pfe+i) 

= -V’Kqfc+i - Pfc+i) + - i/’fc+i)'(qfe+i - Pfe+i) 

= -^l(qfc+i - Pfc) + -il^'kiPk+i - Pk) + i'll’k - ■0fc+i)'(qfc+i - Pfe+i) 

<gk + ek'tpk^k + {tpk - lAfc+O'dfc+i by the definition of gk, d*, and d^+i 


<gk- ekgk + eki'il’k - '0fe)'(dfc - dfe) + (-0^ - 0fc+i)'dfc+i by ( |EC.20 ) 

= (1 “ Gk)gk + i^Zk — ^Zk+i)'^k+1 + efe(0fc ~ 0fc)^(dfe — dfc) (EC.31) 

Now since VZ(-) is continuously differentiable, we have VZk+i = + efeV^Z(pfc + 6kdk)<ik for 


some 9k between 0 and 1. Therefore (EC.31) is equal to 


(1 - ek)gk - efcdfcV^Z(pfc + 0fcdfc)dfc+i + efc(0fc - 0fc)'(dfc - d^) 

< (l-efe) 5fe + efcK||dfe||||dfe+i||+efc(0fc-0fc)'(dfc-dfc) by Condition 7 

< (1 ~ ^k)gk + efc.fil||dfe|| ||dfe+i|| + ekK\\dk — d^H ||dfe+i|| + Cki'ipk — 'i/^kYidk — dfc) 

by triangle inequality 

< (1 ~ ^k)gk + ^kK —77-j—^jr^ -^ + efcErL||0fc — 0fe|| “[WW"-[7 + ^k{'>Pk ~ 0fe)^(dfc — dfe) 

C^||0fc||||0fe+l|| C||0fc+i|| 

by using Assumptionwith the fact that gk<k> and hence 0^,0^ £ A/a(0(p*)), and also 

Assumption The fact gk<i^ will be proved later by induction. 

K KL - .... 

< (l~efe)5'fe + efc , ,ff/t<7fe+i + Cfe-||0fc ~ 0/c||5fe+i + €fc('0fc ~ 0fe)^(dfe — d^,) (EC.32) 

CT 

by Assumption]^ 


Now under the event and noting that e = a/k, (EC.32) implies that 


a\ aK aKL'd ag 

gk+i <[i-j:)9k + + -^9k+i + ^ 


or 


aK aKLi9\ / a\ ag 
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We claim that l^fcl = \ 'ip'f.dk \ < AMTm, which can be seen by writing 

J{Xl = yi) 




V^;.(p) = ^;p[Mx)s}(x*)] = j;i?p 


h{x)-- 


p-i 


-ri?p[h(x)] 


-^E^[h{X)\X, = y;]-TE^[h{X) 


(EC.33) 


t=i 


so that |V^*(p)| < 2MT* for any p and i. Using this and the fact that 1/(1 — x) < 1 + 2x for any 
0 < X < 1 / 2 , we have, for 

iaKMTm. aKM _ I 


we must have 


2aK 2aKL'&\ 


Pk+i < 1 + -Trrrdk + 


crk ) 


^ — T I 9k + 


ag 


k/ 


(EC.35) 


Note that (EC.34) holds if 


, UKMTm KM 
k>2a\ --^- 




CT 


which is Condition 1 in the theorem. Now (EC.35) can be written as 


a 2aKL'd 2a^Kg \ ag 2a^KL'&g 2a^KM 2aK a 
^ —I-;-1 ;;—„ . „ ,— Qk + -1-rw:- —Qk 


k crk c^T‘^k‘^+"i j 9k+ + g.j-^ 2+7 

, a 2aKM 2a'^ 

< 1 -V + 


crfc^ c^r'^k I k) 


2a?Kg \ ag 2a?KL'dg 2aK / a\ 
k ' crk c^T‘^k‘^+^ j9k + + ^.^.^ 2+7 £. 2 .^ 2 /, \ 


9 I 


(EC.36) 


We argue that under Condition 2, we must have gkl^i^ for all k>ko. This can be seen by induction 


using (EC.36). By our setting at the beginning of this proof we have Qkg <y- Suppose gk<v for 
some k. We then have 


gk+i < 1 — 7 + 


a 2aKM 2a'^Kg 


+ 


k crk c^T‘^k‘^+'^ 


v + 


ag 2a?KMg 2aK 


+ 


+ 


E+'l CTk^+'f C^T^k 


1 - ^ W 


2aKg ^ ^ Q ^2aKL'&g ^2Kiy^\ 


a f f 2KM 

CT j ^ ' k2 ' crkl'^'^ ' c^t'^ 


< V 


(EC.37) 


by Condition 2. This concludes our claim. 
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Given that for all k>kQ, (EC.35) implies that 


a ( 2KM\ 2a'^KM 2aKu f a\\ ag a?Q (2Kv 2KM\ 
“ k \ CT ] crk'^ V fc/ j k''-+'^ k‘^+^ V cr J 


CT J crk^ 

'■) 


a f 2KM 2Kv\ 

A: V CT C^t'^ j j fffc fo + ^2+7 ( ^2^2 


c'^T^k V k/ J A:i+''' k'^+"/ \ cr 

ag a?g f 2Ku 2KM\ 


+ 


. C\ G 


CT 


(EC.38) 


where 


and 


C = a 1- 


2KM 2Kv\ 


^ a^g {2 Ktt 2KM\ 


Now note that Condition 3 implies that C > 0. By recursing the relation (EC.38), we get 
X { C\ ^ ^ ( C\ G 

9k+i <n(^“ 7 ) 5 fco+zi n 

j=ko ^ j=koi=j+l \ 


j=ko •’ 


=-ce7+i1A_^ 


< 


if 0 < 7 < C 


(*fl) »« + (' + 7 ) 


(C-7)(fc+l)T' 

,-m-cvfc+i)C if7>C 


(7-C’)(/co- 
log(fc/(fco-l)) 

(Mup 


if 7 = c 


which gives ([29j). This concludes the proof. 

Proof of Corollary^ We use the notations in the proof of TheoremOur analysis starts from 


(EC. 19), namely 


Zk+i — Zk + (p^, + 6kek(ik)<^k 

for some ¥ between 0 and 1. Using the fact that tp'kdk > tl’kd-k by the definition of d^, we have 


Zk+i P Zk + ek'ipkd-k T (p^, + 9k€kdk)dk 


— Zk — CkSk P T^d'jfSJ Z(pk P 9kCkdk)dk 
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Now, using (29) and Condition 7 in Theorem]^ we have 

( 

A 


^fe+i 






+ Bx < 


V 


^ aA „ , 


if 0 < 7 < C 
-ip-ckc if 7 > 


\ 


(l-C)(ko 

log((fc-l)/(fcO-l)) ^ 

ic-^]ki+^f if 0 < 7 < C 

(7-C)(feo-i)^“‘"7i+c' if 7 > C* 

log((fc-l)/(fco-l)) ry — (J 




/ 


- 




(EC.39) 


Now iterating (EC.39) from k to I, we have 


i-i 


aA 


E]=l 3^ if 0 < 7 < C 

Ei=l if 7 > <^ 




i=fe 


(7-C)(/co-l)^-C 

log(0-l)/(fco-l)) 


1 

>-^E-= 


and letting / —)> oo, we get 


OO . 

i=fc 


1 _ 1 

(C-7) ^i=fc ji+r 


if 7 = C 


if 0 < 7 < C 


3=k 




■Sipoo log((i-l)/(fco-l)) 
2-^j—k A+C 


EE.7i^if7>C 

if 7 = C 


- 




OO 

Ef 


2^7 

i=fe 


77 (EC.40) 


where the convergence to Z* is guaranteed by Theorem]^ Note that (EC.40) implies that 

if 0 < 7 < C 


Z*>Zk- 


aA 


C{k-l)(^ 


— aB X < 


(C-7)7(fc-l)T' 
1 


{■y-C){kQ-l)'<-Cc(k-l)C if 7 > C* 

if 7 = c 


- 


log((fc-l)/(fco-l)) 

C(fe-l)C 


a^K 

2{k-l) 


> Zh — 


D 


E 


if 0 < 7 < C* 


k-l [k-iy 


- F X < 


(C-7)7(fc-l)7 

(7-c)(feo-i)^“'^C'(fc-i)C if 7 > C 


log((fc-l)/(fco-l)) 

C(fc-l)C 


if 7 = C7 


where D = a^K/2, E = aA/C and F = aB. This gives (31). 

Proof of Lemma\^ Consider first a fixed a. When a(l — w) > 1, (32) reduces to ^ EPi' 

Since is increasing in /3 and is decreasing in /3, the maximizer of A occurs 
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at the intersection of ^ 2 (^+ 1 ) ^ which is /3 = pa + C + 4. The associated value of ([3^ 


/ 3+1 


IS 


pa+C+5■ 

When a(l —a;)<l, (32) reduces to A By a similar argument, the maximizer is 

/3 = a(2 — 2w + p) + C + 2, with the value of (32) equal to q( 2 - 2 LV^+c +3 - 

Thus, overall, given a, the optimal choice of /3 is /3 = pa + C + 2 + 2((a(l — w)) A 1), with the 
value of (@ given by ^,^^^^ 3 | 27 (a?r-l))Ai) ■ ^hen a(l - w) > 1 , the value of (@ is which is 


decreasing in a, whereas when a(l — w) < 1, the value of (32) is a(2-l7Tp)+C+3 increasing 


in a. Thus the maximum occurs when a(l — w) = 1, or a = The associated value of (32) is 


p/{l-ui)+C.+b- 

Remark EC.l. Suppose that Assumptionis replaced by letting 

l|v(6) - v(6)ll < A||6 -611 


hold for any ^ 1,^2 £ 1^^- Then, in the proof of Theorem]^ the inequality (EC.l) can be replaced 
by 


< — E[\\il)k - -0fep]E[||dfc - dfc|| 2 ] by Cauchy-Schwarz inequality 
Q 


< 


k^L 


E[||’0fe — i/’fclfl by the relaxed Assumption]^ 


fc-i 


LM^TNk'^ T-r, 

< - n<‘ 


RkgS 


- e,- 


by following the derivation in (EC.24) and (EC.25) 




Consequently, equation (EC.30) becomes 


feo-i 

i=i 


^M^TN 


1 


+ ■ 


L 


5b pa-l)iko-iy-^ p(/3-7-pa-l)(fco-l)'5-'^-i 

if Condition 5 is replaced by 

/3 > 7 + pa + 1 


Correspondingly, Condition 6 needs to be replaced by 


fco-i 


i=i 


\-i 


M^TN 


1 


+ 


L 


5b \'d‘^{j3 — pa — l){kQ — lY~^ p(/5 — 7 — pa — l)(/co — 1 )^“^ 


< e 
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The results in Theorem and Corollary then retain. Under these modified Conditions 5 and 
6 , discussion point 3(b) in Section |6.2| then gives P = 'j +pa+ 1 + C for some C > 0 and 7 = 
P — pa — ( — 1. In discussion point 4, the convergence rate in terms of replications becomes 
“))^(/3—p“—C—i)^i)/(4+i) ^ By maximizing 


(a(l — a;)) A {(3 — pa— ( — 1) Al 


(EC.41) 


like in (32) by Lemma(see Lemma EC.2 right after this remark), we get 


a —— T~^ -l-C + 2 

1 — UJ 1 — OJ 


and the optimal value is 


1 


/ 3 /(l — a;) + C + 3 

So, following the argument there, we choose d and v, and hence w, to be small, and we choose p 
to be close to 1. This gives rise to the approximate choice that a ss 1 + w and /? 3 + C + w- The 


convergence rate is then 0{W U(4+c+“)^^ leading to our claim in Section 6.2 that the complexity 
can improve to 0(l/e^''"‘’''‘‘^) if Assumption is relaxed. 


Lemma EC.2. The maximizer of (EC.41) is given hy 


a—-, P — -l-C + 2 

1 — OJ 1 — CJ 


and the optimal value is 


/3/(l — a;) + C + 3 


Proof of Lemma EC.2 Consider first a fixed a. When a(l — w) > 1, (EC.41) reduces to 


^ ^ A Since ^ ^ is increasing in /3 and is decreasing in /3, the maximizer of 


^ p°- ^ A occurs at the intersection of ^ ^ and ^ 7 )-, which is fi = pa + f + 2. The asso- 


/ 3+1 


/ 3+1 


/ 3 + 1 ’ 


ciated value of (EC.41) is 
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When a(l — cj) < 1, ( |EC.41[ ) reduces to A ^ By a similar argument, the maximizer 


is /3 = a(l — w + /o) + C + 1, with the value of (EC.41) equal to . 

Thus, overall, given a, the optimal choice of ,0 is /3 = pa + C + 1 + (a(l — oj)) A 1, with the value 
of ( |EC.41D given by ■ When a(l - a;) > 1, the value of ( |EC.4lD is which is 


decreasing in a, whereas when a(l — a;) < 1, the value of (32) is which is increasing 


in a. Thus the maximum occurs when a(l — a;) = 1, or a = The associated value of (EC.41) is 


p/(l—cj)+C+3 
















